candy_file <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv"
candy <- read.csv(candy_file, row.names=1)
head(candy)
## chocolate fruity caramel peanutyalmondy nougat crispedricewafer
## 100 Grand 1 0 1 0 0 1
## 3 Musketeers 1 0 0 0 1 0
## One dime 0 0 0 0 0 0
## One quarter 0 0 0 0 0 0
## Air Heads 0 1 0 0 0 0
## Almond Joy 1 0 0 1 0 0
## hard bar pluribus sugarpercent pricepercent winpercent
## 100 Grand 0 1 0 0.732 0.860 66.97173
## 3 Musketeers 0 1 0 0.604 0.511 67.60294
## One dime 0 0 0 0.011 0.116 32.26109
## One quarter 0 0 0 0.011 0.511 46.11650
## Air Heads 0 0 0 0.906 0.511 52.34146
## Almond Joy 0 1 0 0.465 0.767 50.34755
Q1. How many different candy types are in this dataset?
nrow(candy)
## [1] 85
A1. There are 85 different candy types in this data set.
Q2. How many fruity candy types are in the dataset?
sum(candy$fruity)
## [1] 38
A2. There are 28 fruity candy types in the dataset.
candy["Twix",]$winpercent
## [1] 81.64291
Q3. What is your favorite candy in the dataset and what is it’s winpercent value?
My favorite candy in the data set is Peanut butter M&M’s.
candy["Peanut butter M&MÕs",]$winpercent
## [1] 71.46505
It’s win percent is approximately 71.5%.
Q4. What is the winpercent value for “Kit Kat”?
candy["Kit Kat",]$winpercent
## [1] 76.7686
The win percent value for Kit Kat is approximately 76.8%.
Q5. What is the winpercent value for “Tootsie Roll Snack Bars”?
candy["Tootsie Roll Snack Bars",]$winpercent
## [1] 49.6535
The win percent value for Tootsie Roll Snack Bars is approximately 50.0%.
Download the skimr package to give overview of dataset.
library("skimr")
skim(candy)
| Name | candy |
| Number of rows | 85 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| chocolate | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| fruity | 0 | 1 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| caramel | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| peanutyalmondy | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| nougat | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| crispedricewafer | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| hard | 0 | 1 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| bar | 0 | 1 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| pluribus | 0 | 1 | 0.52 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
| sugarpercent | 0 | 1 | 0.48 | 0.28 | 0.01 | 0.22 | 0.47 | 0.73 | 0.99 | ▇▇▇▇▆ |
| pricepercent | 0 | 1 | 0.47 | 0.29 | 0.01 | 0.26 | 0.47 | 0.65 | 0.98 | ▇▇▇▇▆ |
| winpercent | 0 | 1 | 50.32 | 14.71 | 22.45 | 39.14 | 47.83 | 59.86 | 84.18 | ▃▇▆▅▂ |
Q6. Is there any variable/column that looks to be on a different scale to the majority of the other columns in the dataset?
A6. Yes, most of the rows have values ranging from 0 to 1, while winpercent ranges from 0 to 100.
Q7. What do you think a zero and one represent for the candy$chocolate column?
A7. In the candy$chocolate column, a zero represents that the candy is not chocolate-y, while a one represents that the candy is chocolate-y.
Q8. Plot a histogram of winpercent values
hist(candy$winpercent)
library(ggplot2)
ggplot (candy, aes(x=candy$winpercent)) +
geom_bar()
## Warning: Use of `candy$winpercent` is discouraged. Use `winpercent` instead.
Q9. Is the distribution of winpercent values symmetrical?
A9. No.
Q10. Is the center of the distribution above or below 50%?
A10. Below.
Q11. On average is chocolate candy higher or lower ranked than fruit candy?
First need to find all the chocolate candy row in the ‘candy’ dataset.
chocolate <- as.logical(candy$chocolate)
chocolate_win <- candy[chocolate,]$winpercent
chocolate_win
## [1] 66.97173 67.60294 50.34755 56.91455 38.97504 55.37545 62.28448 56.49050
## [9] 59.23612 57.21925 76.76860 71.46505 66.57458 55.06407 73.09956 60.80070
## [17] 64.35334 47.82975 54.52645 70.73564 66.47068 69.48379 81.86626 84.18029
## [25] 73.43499 72.88790 65.71629 34.72200 37.88719 76.67378 59.52925 48.98265
## [33] 43.06890 45.73675 49.65350 81.64291 49.52411
Do the same for fruity candy:
fruity <- as.logical(candy$fruity)
fruity_win <- candy[fruity,]$winpercent
fruity_win
## [1] 52.34146 34.51768 36.01763 24.52499 42.27208 39.46056 43.08892 39.18550
## [9] 46.78335 57.11974 51.41243 42.17877 28.12744 41.38956 39.14106 52.91139
## [17] 46.41172 55.35405 22.44534 39.44680 41.26551 37.34852 35.29076 42.84914
## [25] 63.08514 55.10370 45.99583 59.86400 52.82595 67.03763 34.57899 27.30386
## [33] 54.86111 48.98265 47.17323 45.46628 39.01190 44.37552
Compare chocolate vs. fruity candy.
# Average chocolate win percent...
chocolate_win_avg <- mean(chocolate_win)
chocolate_win_avg
## [1] 60.92153
# Average fruity win percent...
fruity_win_avg <- mean(fruity_win)
fruity_win_avg
## [1] 44.11974
# Is average chocolate win percent greater than the average fruity win percent?
chocolate_win_avg > fruity_win_avg
## [1] TRUE
A11. Choclate candy is higher ranked than fruity candy.
Q12. Is this difference statistically significant?
T-test!
t.test(chocolate_win,fruity_win)
##
## Welch Two Sample t-test
##
## data: chocolate_win and fruity_win
## t = 6.2582, df = 68.882, p-value = 2.871e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 11.44563 22.15795
## sample estimates:
## mean of x mean of y
## 60.92153 44.11974
A12. A t-test produces a p-value of 2.871e-0.8, which is much smaller than the typical threshold of 0.05. Yes, it is statistically significant.
Q13. What are the five least liked candy types in this set?
head(candy[order(candy$winpercent),], n=5)
## chocolate fruity caramel peanutyalmondy nougat
## Nik L Nip 0 1 0 0 0
## Boston Baked Beans 0 0 0 1 0
## Chiclets 0 1 0 0 0
## Super Bubble 0 1 0 0 0
## Jawbusters 0 1 0 0 0
## crispedricewafer hard bar pluribus sugarpercent pricepercent
## Nik L Nip 0 0 0 1 0.197 0.976
## Boston Baked Beans 0 0 0 1 0.313 0.511
## Chiclets 0 0 0 1 0.046 0.325
## Super Bubble 0 0 0 0 0.162 0.116
## Jawbusters 0 1 0 1 0.093 0.511
## winpercent
## Nik L Nip 22.44534
## Boston Baked Beans 23.41782
## Chiclets 24.52499
## Super Bubble 27.30386
## Jawbusters 28.12744
A13. Nik L Nip, Boston Baked Beans, Chiclets, Super Bubble, and Jawbusters are the least liked candy types (i.e. the candy types with the lowest winpercent).
Q14. What are the top 5 all time favorite candy types out of this set?
head(candy[order(-candy$winpercent),], n=5)
## chocolate fruity caramel peanutyalmondy nougat
## ReeseÕs Peanut Butter cup 1 0 0 1 0
## ReeseÕs Miniatures 1 0 0 1 0
## Twix 1 0 1 0 0
## Kit Kat 1 0 0 0 0
## Snickers 1 0 1 1 1
## crispedricewafer hard bar pluribus sugarpercent
## ReeseÕs Peanut Butter cup 0 0 0 0 0.720
## ReeseÕs Miniatures 0 0 0 0 0.034
## Twix 1 0 1 0 0.546
## Kit Kat 1 0 1 0 0.313
## Snickers 0 0 1 0 0.546
## pricepercent winpercent
## ReeseÕs Peanut Butter cup 0.651 84.18029
## ReeseÕs Miniatures 0.279 81.86626
## Twix 0.906 81.64291
## Kit Kat 0.511 76.76860
## Snickers 0.651 76.67378
A14. Reese’s Peanut Butter Cups, Reese’s Miniatures, Twix, Kit Kats, and Snickers are the most liked candy types (i.e. the candy types with the highest winpercent).
Q15. Make a first barplot of candy ranking based on winpercent values.
library(ggplot2)
ggplot(data=candy) +
aes(winpercent, rownames(candy)) +
geom_col()
Q16. This is quite ugly, use the reorder() function to get the bars sorted by winpercent?
ggplot(data=candy) +
aes(winpercent, reorder(rownames(candy), winpercent)) +
geom_col()
Make a color vector.
my_cols=rep("black", nrow(candy))
my_cols[as.logical(candy$chocolate)] = "chocolate4"
my_cols[as.logical(candy$bar)] = "darkorchid4"
my_cols[as.logical(candy$fruity)] = "darkorange"
Add color vector to the plot
ggplot(candy) +
aes(winpercent, reorder(rownames(candy),winpercent)) +
geom_col(fill=my_cols) +
labs (title="Win Percent of Various Halloween Candies",
x=("Win Percent"), y=("Halloween Candy"))
Q17. What is the worst ranked chocolate candy?
Q18. What is the best ranked fruity candy?
# How about a plot of price vs win
library(ggrepel)
ggplot(candy) +
aes(winpercent, pricepercent, label=rownames(candy)) +
geom_point(col=my_cols) +
geom_text_repel(col=my_cols, size=3.3, max.overlaps = 10) +
labs (title="Price Percent vs. Win Percent of Various Halloween Candies",
x=("Win Percent"), y=("Price Percent"))
## Warning: ggrepel: 21 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Q19. Which candy type is the highest ranked in terms of winpercent for the least money - i.e. offers the most bang for your buck?
ord <- order(candy$pricepercent, decreasing = TRUE)
head( candy[ord,c(11,12)], n=5 )
## pricepercent winpercent
## Nik L Nip 0.976 22.44534
## Nestle Smarties 0.976 37.88719
## Ring pop 0.965 35.29076
## HersheyÕs Krackel 0.918 62.28448
## HersheyÕs Milk Chocolate 0.918 56.49050
Q20. What are the top 5 most expensive candy types in the dataset and of these which is the least popular?
library(corrplot)
## corrplot 0.90 loaded
cij <- cor(candy)
corrplot(cij)
pca <- prcomp(candy, scale=TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0788 1.1378 1.1092 1.07533 0.9518 0.81923 0.81530
## Proportion of Variance 0.3601 0.1079 0.1025 0.09636 0.0755 0.05593 0.05539
## Cumulative Proportion 0.3601 0.4680 0.5705 0.66688 0.7424 0.79830 0.85369
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.74530 0.67824 0.62349 0.43974 0.39760
## Proportion of Variance 0.04629 0.03833 0.03239 0.01611 0.01317
## Cumulative Proportion 0.89998 0.93832 0.97071 0.98683 1.00000
Now we can plot our main PCA score plot of PC1 vs PC2.
plot(pca$x[,1:2], col=my_cols, pch=16)
# Make a new data-frame with our PCA results and candy data
my_data <- cbind(candy, pca$x[,1:3])
p <- ggplot(my_data) +
aes(x=PC1, y=PC2,
size=winpercent/100,
text=rownames(my_data),
label=rownames(my_data)) +
geom_point(col=my_cols)
p
library(ggrepel)
p + geom_text_repel(size=3.3, col=my_cols, max.overlaps = 7) +
theme(legend.position = "none") +
labs(title="Halloween Candy PCA Space",
subtitle="Colored by type: chocolate bar (dark brown), chocolate other (light brown), fruity (red), other (black)",
caption="Data from 538")
## Warning: ggrepel: 39 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(p)
par(mar=c(8,4,2,2))
barplot(pca$rotation[,1], las=2, ylab="PC1 Contribution")
Q24. What original variables are picked up strongly by PC1 in the positive direction? Do these make sense to you?
A24. Positive direction:fruity, hard, pluribus; negative direction: chocolate, caramel, peanuty-almondy, nougat, crisper-rice-wafer, bar, sugar percent, price percent, win percent